Introduction to DataStage

Ratings:
(4)
Views: 0
Banner-Img
Share this blog:

DataStage Overview

It is a Comprehensive ETL Tool, Which provides, end to end ERP Solutions.

Some of the Most popular ETL Tools are:

  • DSPX àleader of ETL Tools, Started from 2006
  • Informatics
  • ODI
  • SAS (ETL STUDIO)
  • BODI
  • ABNITRO

Do you want to master DataStage? Then enrol in "DataStage Training" This course will help you to master DataStage

History of DataStage

Has more than 12 years of History

1st release was in 1997

1997 – VMARK – UK  - - >

Mr. LEE SCHEFFLER  - - > Father of Data stage

 - - >Data Stage was called as Data Integrator   during 1997  - - > Torrent (Data Integrator)

DataStage    

IBM has acquired Informix with Database is 2000.

  • 2000 ASCENTIAL Data Stage Server Edition

Cit is the combination if Informix + Data Integrator

  • 2000 ASCENTIAL Data Stage Server + DRCHESTRATE
  • Orchestrate is an ETL Tool, and is has Extensive parallel capabilities
  • It is only Executed on UNIX flavors

 UNIX flavors

  • ALX
  • Linux
  • HPUX
  • SUNSOLARIS

- - > Due to the Combination with DRCHESTRATE, Data Stage acquired a parallel combination

  • Version – 6
  • Version – 6 – 7.5.1
  • Parallel Environment

 

ASCENTIAL Data Stage PX (Parallel extender)

It Can be Configured only on UNIX flavours

- - > Up to Version 7.5.1, Server Components are configured only on UNIX flavours à  

2004 December

  • 5 * 2 ASDSPX + MKS Tool kit

            ↓            

Accentual Data Stage PX      To create a virtual environment (like UNIX)  In XP to run the Data Stage.

  • So, the MKS Tool kit has the capability to run the Data stage on windows.
  • 5 * 2 ASDSPX + MKS Tool kit

Can perform only Data Transformation

  - - >MKS Tool kit à Assential  Suite Components

Release

(a) Profile Stage

(b) Quality  Stage

(c) Audit Stage

(d)Meta Stage

(e)Data Stage PX

(f)Data Stage TX as Software  

2005  

- ->IBM has acquired entire ASCENTIAL - - >     IBM Data Stage Enterprise Edition 7.5 *2 - - >  (used by 50 % of users)  

2006  

- ->IBM Web sphere Data Stage & Quality Stage 8.0.1   - -> IDE (Integrated Environment)  - -> (used by 40 % of users)  

Integrated Environment of

(a) Profile Stage

(b) Quality Stage

(c) Audit Stage

(d)Meta Stage

(e)Data Stage PX    

2009

- ->IBM infasphure   Data Stage & Quality Stage 8.0.1   - ->  Improved web servicers  & Server has changed.  - -> (used by 10 % of users)  

Features of Data Stage

  1. Any to Any
  2. Platform Independent
  3. Node Configuration
  4. Portion Parallelism
  5. Pipeline Parallelism

Any to Any

Reads the data from any Source and loads it to any Target.

Any SRC    ↔   Any Target  

Platform Independent

Designed for one O.S, can be executed

  - - >Platform generally can be either Software or Hardware.

Platform Independent      

  • - > In the Data stage, Platform is w. r. t Hardware.

Hardware environment 

  • Uni processing Environment

  Hard disk à CPU - - > RAM  

  • Symmetric Multi-Processing: - (SMP)

  Hard Disk

can have 32–64 CPU that is Hard disk with multiple CPU‘S

  Massively Parallel processing:-  (MPP)

MPP

  • Collection of different SMPS

  Node Configuration

  • The best feature of the Data stage
  • It is a technique of creating logical CPUs

  Node - - > logical CPU (or) instance of (physical) CPU

àIt is an S/W which will Create virtual CPU’S

  • Data Stage is Executed on logical CPU’S
  • TO run a job in the Data stage, WE require at least 1 Node.

EX:- ETL  

UNI Process

Hard disk - - > CPU - - > RAM

  • To access 1000 records, it takes 10 mins.

SMP SMP

  • To access 1000 records, with 4 CPU’S it takes 2.5 min

Node config:

  • Uni Processing - - > Virtual SMF

Node Configuration

S is not using the max. capabilities of CPU, So Node config. is an S/W Which drives into different Nodes. That is Boost up the Capabilities & Energy level of CPU  

Partition parallelism

- - > Horizontal Combining

  - - > Combining primary rows with Secondary rows w. r. t  Key column values

Partition Parallelism

Partitioning

It is a technique of distributing the records across the nodes, based on partitioning techniques.

Partitioning Techniques  

  • In addition, We have a 9th technique known as ‘AUTO’

  NOTE:

  • Partitioning techniques plays an important role in Performance Tuning

  Note:-

- - > Key-based technique assures that the same key column values are collected at the same partition.  

Ex:-

 EMP

DNO= Primary key  

E NO E Name DNO
11 a 10
12 b 20
13 c 10
14 d 30
15 e 20

   

D NO  D Name  Loc 
10 ACE Hyd
20 Meter Sec
30 Sales Eng

  When combine, I.e, using a horizontal combination

  Horizontal combining  

That is Same key column values are collected at the same partition  

Repatriating

The Portioned data is once again repatriated

Ex:  

EName Dno Loc
A 10 AP
B 20 TN
C 10 TN
D 20 KN
E 30 TN
F 10 KN
G 20 AP

  Repatriating  

  • Partitioning and Repatriating are automatic processes in the Data stage

  Reverse Partitioning

  • Reverse Partitioning is collecting the data from the nodes.
  • It happens only in 1 Situation that is Parallel to Sequential.

Reverse Partitioning    

Reverse Partitioning is also called as Collecting  

Different Collecting Methods

  1. Ordered
  2. Round Robin
  3. Sort – Merge
  4. Auto

Pipeline Parallelism

Simultaneously doing the extraction of Transforming and loading jobs.

Pipe link

A channel through which data moves from one stage to another stage

  Pipe link  

Traditional Batch Processing:-

(Server jobs)

Sequential processing  

EX:-  for Suppose, We have 3 instructions

I1 – Fetch (F), Decode (D), Execute (E), Write lock (W)

I2 – F, D, E, W

I3 –F,D, E,W

- - > In sequential process

Traditional Batch Processing  

Parallel Processing

Parallel Processing  

Running all transactions in parallel  

T1 T2 T3 T4 T5 T6 T7 T8
F D E W        
  F D E W      
    F D E W    
      F D E W  
        F D E W

 

 The core difference between Version 7.5 *2 and 8.0.1  of DataStage

7.5*2 8.0.1
  1. 4 client components
  1. DS Designer
  2. DS Director
  3. DS Manger
  4. DS Admin
5 client components
  1. DS Designer
  2. DS Director
  3. DS admin
  4. Web console
OS-dependent(OS; the user will be data stage users) OS independent(User can be created at datastage, but one dependent)
File-based repository(Folder) Database repository (default is DB/2)
No web-based administration Web-based administration
  1. 2 architecture components
  1. Server
  2. client
5 architecture components
  1. common user interface
  2. common repository
  3. common engine
  4. common connectivity
  5. common shared services
can perform phase 3,4 Can perform phase 1,2,3,4
2 tier N tier

    Note:--

Features of Manager in 7.5 *2, are integrated into a designer in 8.0.1

(a) In 7.5 * 2 user id and  used to login for authentication, are created in the O.S, O.S wires will   become D.S users

(b) In 8.0.1, they are created at the Data stage Environment    

  • Repository

In  7.5 *2, everything is Stored in the folder in the form of files

  • 8.0.1

Data is organized in 2 layers

  • Global Repository àData base à more security
  • Local Repository à folder àperformance
  • In 8.0.1, admin can work from home that is using the web console component

4.(a) In 7.5 * 2 it is 2 –tier

S - - > server

C - - >  machine  

(b)In 8, We can have multiple Servers / Engine, Only 1 Repository

R- C1, C- C2, E1 – C3, E2 – C4, E3 – C4 ------En – Cn - - > n –tier components can be configured in n no of machines.  

Client components of 7.5 * 2 and 8.0.1

 7.5*2 Designer  

  • Create jobs (Mainframe Jobs (MF),  Server jobs  (SJ) Sequence jobs (SJ), Parallel Jobs (PJ)
  • Compile
  • Run
  • Multiple job Compile

  Director

  • Views
    • Jobs
    • Status
    • Logs
  • Monitor
  • Batch Jobs
  • unlock jobs
  • Message Handing
  • Schedule jobs

Manager   

  • Import / Export
  • Node Configuration

Admin

  • Create projects
  • Delete projects
  • Organize project

8.1.0 Designer  

  • Create jobs (Mainframe Jobs (MF),  Server jobs  (SJ) Sequence jobs (SJ), Parallel Jobs (PJ)
  • Compile
  • Run
  • Multiple job Compile
  • Import / Export
  • Node Configuration
  • Advanced Find
  • Performance Analysis
  • Estimate Resource

Director

  • Views
  • Jobs
  • Status
  • Logs
  • Monitor
  • Batch Jobs
  • unlock jobs
  • Message Handing
  • Schedule jobs

Admin  

  • Create projects
  •  Delete projects
  • Organize projects

Web Console

  • Security  Services
  • Reporting Services
  • Logging Services
  • Scheduling Services
  • Domain Management
  • Session Management

Information Analyzer:--/ Console for IBM Information Service

Data profiling  (CA, PA, FA, Baseline, Cross-domain)

For an in-depth understanding of DataStage click on

 

You liked the article?

Like: 0

Vote for difficulty

Current difficulty (Avg): Medium

EasyMediumHardDifficultExpert
IMPROVE ARTICLEReport Issue

About Author

Authorlogo
Name
TekSlate
Author Bio

TekSlate is the best online training provider in delivering world-class IT skills to individuals and corporates from all parts of the globe. We are proven experts in accumulating every need of an IT skills upgrade aspirant and have delivered excellent services. We aim to bring you all the essentials to learn and master new technologies in the market with our articles, blogs, and videos. Build your career success with us, enhancing most in-demand skills in the market.